Automatically Tuned FFTs for BlueGene/L's Double FPU
نویسندگان
چکیده
IBM is currently developing the new line of BlueGene/L supercomputers. The top-of-the-line installation is planned to be a 65,536 processors system featuring a peak performance of 360 Tflop/s. This system is supposed to lead the Top 500 list when being installed in 2005 at the Lawrence Livermore National Laboratory. This paper presents one of the first numerical kernels run on a prototype BlueGene/L machine. We tuned our formal vectorization approach as well as the Vienna MAP vectorizer to support BlueGene/L’s custom two-way short vector SIMD “double” floating-point unit and connected the resulting methods to the automatic performance tuning systems Spiral and Fftw. Our approach produces automatically tuned high-performance FFT kernels for BlueGene/L that are up to 45% faster than the best scalar Spiral generated code and up to 75% faster than Fftw when run on a single BlueGene/L
منابع مشابه
Vectorization Techniques for BlueGene/L’s Double FPU
This paper presents vectorization techniques tailored to meet the specifics of the twoway single-instruction multiple-data (SIMD) double-precision floating-point unit, which is a core element of the node ASICs of IBM's 360 Tflop/s supercomputer BlueGene/L. The paper focuses on the general-purpose basic-block vectorization methods provided by the Vienna MAP vectorizer. In addition, the paper int...
متن کاملVectorization techniques for the Blue Gene/L double FPU
This paper presents vectorization techniques tailored to meet the specifics of the two-way single-instruction multiple-data (SIMD) double-precision floating-point unit (FPU), which is a core element of the node application-specific integrated circuit (ASIC) chips of the IBM 360-teraflops Blue Genet/L supercomputer. This paper focuses on the general-purpose basic-block vectorization and optimiza...
متن کاملPFFT - A Massively Parallel Extension of FFTW
We present a MPI based software library for computing the fast Fourier transforms on massively parallel, distributed memory architectures. Similar to established transpose FFT algorithms, we propose a parallel FFT framework that is based on a combination of local FFTs, local data permutations and global data transpositions. This framework can be generalized to arbitrary multi-dimensional data a...
متن کاملEfficient FFTs on IRAM
Computing Fast Fourier Transforms (FFTs) is notoriously difficult on conventional general-purpose architectures because FFTs require high memory bandwidth and strided memory accesses. Since FFTs are important in signal processing, several DSPs have hardware support for performing FFTs; moreover, some DSPs are designed solely for the purpose of computing FFTs and related transforms. In this pape...
متن کاملObtaining Hardware Performance Metrics for the BlueGene/L Supercomputer
Hardware performance monitoring is the basis of modern performance analysis tools for application optimization. We are interested in providing such performance analysis tools for the new BlueGene/L supercomputer as early as possible, so that applications can be tuned for that machine. We are faced with two challenges in achieving that goal. First, the machine is still going through its final de...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004